Tutorial on Correlations in R

Author

Arvind Venkatadri

Published

November 22, 2022

Modified

December 14, 2023

Abstract
Tests, Tables, and Graphs for Correlations in R

Introduction

We will create Tables for Correlations, and graphs for Correlations in R. As always, we will consistently use the Project Mosaic ecosystem of packages in R (mosaic, mosaicData and ggformula).

Setting up R Packages

```{r}
#| label: setup
#| message: true

library(tidyverse)
```
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
```{r}
#| label: setup
#| message: true

library(mosaic) # package for stats, simulations, and basic plots
```
Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

The 'mosaic' package masks several functions from core packages in order to add 
additional features.  The original behavior of these functions should not be affected by this.

Attaching package: 'mosaic'

The following object is masked from 'package:Matrix':

    mean

The following objects are masked from 'package:dplyr':

    count, do, tally

The following object is masked from 'package:purrr':

    cross

The following object is masked from 'package:ggplot2':

    stat

The following objects are masked from 'package:stats':

    binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
    quantile, sd, t.test, var

The following objects are masked from 'package:base':

    max, mean, min, prod, range, sample, sum
```{r}
#| label: setup
#| message: true

library(ggformula) # package for professional looking plots, that use the formula interface from mosaic
library(skimr)
```

Attaching package: 'skimr'

The following object is masked from 'package:mosaic':

    n_missing
```{r}
#| label: setup
#| message: true

library(GGally)
```
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
```{r}
#| label: setup
#| message: true

library(corrplot) # For Correlogram plots
```
corrplot 0.92 loaded
```{r}
#| label: setup
#| message: true

library(broom) # to properly format stat test results

library(mosaicData) # package containing datasets
library(NHANES) # survey data collected by the US National Center for Health Statistics (NCHS)
```

All R functions seen in the code are clickable links that take you to online documentation about the function. Try!
The Formula interface

Note the standard method for all commands from the mosaic package:

goal( y ~ x | z, data = mydata, …)

With ggformula, one can create any graph/chart using:

gf_geometry(y ~ x | z, data = mydata)

OR

mydata %>% gf_geometry( y ~ x | z )

The second method may be preferable, especially if you have done some data manipulation first! More about this later!

Case Study 1: Galton Dataset from mosaicData

Let us inspect what datasets are available in the package mosaicData. Run this command in your Console:

```{r, eval=FALSE}
# Run in Console
data(package = "mosaicData")
```

The popup tab shows a lot of datasets we could use. Let us continue to use the famous Galton dataset and inspect it:

```{r}
data("Galton")
```

Inspecting the Data

The inspect command already gives us a series of statistical measures of different variables of interest. As discussed previously, we can retain the output of inspect and use it in our reports: (there are ways of dressing up these tables too)

```{r Galton-inspect}
galton_describe <- inspect(Galton)

galton_describe$categorical
galton_describe$quantitative
```

Try help("Galton") in your Console. The dataset is described as:

A data frame with 898 observations on the following variables.
- family a factor with levels for each family
- father the father’s height (in inches)
- mother the mother’s height (in inches)
- sex the child’s sex: F or M
- height the child’s height as an adult (in inches)
- nkids the number of adult children in the family, or, at least, the number whose heights Galton recorded.

There is a lot of Description generated by the mosaic::inspect() command ! Let us also look at the output of skim:

```{r}
#| column: body-outset-right
skimr::skim(Galton)
```
Data summary
Name Galton
Number of rows 898
Number of columns 6
_______________________
Column type frequency:
factor 2
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
family 0 1 FALSE 197 185: 15, 166: 11, 66: 11, 130: 10
sex 0 1 FALSE 2 M: 465, F: 433

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
father 0 1 69.23 2.47 62 68 69.0 71.0 78.5 ▁▅▇▂▁
mother 0 1 64.08 2.31 58 63 64.0 65.5 70.5 ▂▅▇▃▁
height 0 1 66.76 3.58 56 64 66.5 69.7 79.0 ▁▇▇▅▁
nkids 0 1 6.14 2.69 1 4 6.0 8.0 15.0 ▃▇▆▂▁

What can we say about the dataset and its variables? How big is the dataset? How many variables? What types are they, Quant or Qual? If they are Qual, what are the levels? Are they ordered levels? Which variables could have relationships with others? Why? Write down these Questions!

Correlations and Plots

What Questions might we have, that we could answer with a Statistical Measure, or Correlation chart?

Questions

How does children’s height correlate with that of father and mother? Is this relationship also affected by sex of the child?

With this question, height becomes our target variable, which we should always plot on the dependent y-axis.

```{r}
#| label: ggpairs-1
#| message: false

# Pulling out the list of Quant variables from NHANES
galton_quant <- galton_describe$quantitative
galton_quant$name

GGally::ggpairs(
  Galton,
  
  # Choose the variables we want to plot for
  columns = c("father", "mother", "height", "nkids"),
  
  switch = "both", # axis labels in more traditional locations
  progress = FALSE, # no compute progress messages needed
  
  # Choose the diagonal graphs (always single variable! Think!)
  diag = list(continuous = "barDiag"), # choosing histogram,not density
  
  # Choose lower triangle graphs, two-variable graphs
  lower = list(continuous = wrap("smooth", alpha = 0.1)),
  
  title = "Galton Data Correlations Plot"
) + 
  
  theme_bw()
```
[1] "father" "mother" "height" "nkids" 

We note that children’s height is correlated with that of father and mother. The correlations are both positive, and that with father seems to be the larger of the two. ( Look at the slopes of the lines and the values of the correlation scores. )

Question

What if we group the Quant variables based on a Qual variable, like sex of the child?

```{r}
#| label: ggpairs-2 with groups
#| message: false

# Pulling out the list of Quant variables from NHANES
galton_quant <- galton_describe$quantitative
galton_quant$name

GGally::ggpairs(
  Galton,
  
  mapping = aes(colour = sex), # Colour by `sex`

  # Choose the variables we want to plot for
  columns = c("father", "mother", "height", "nkids"),
  switch = "both", # axis labels in more traditional locations
  progress = FALSE, # no compute progress messages needed
  
  diag = list(continuous = "barDiag"),
  
  # Choose lower triangle graphs, two-variable graphs
  lower = list(continuous = wrap("smooth", alpha = 0.1)),
  
  title = "Galton Data Correlations Plot"
) + 
  
  theme_bw()
```
[1] "father" "mother" "height" "nkids" 

The split scatter plots are useful, as is the split histogram for height: Clearly the correlation of children’s height with father and mother is positive for both sex-es. The other plots, and even some of the correlations scores are not all useful! Just shows everything we can compute is not necessarily useful immediately.

In later modules we will see how to plot correlations when the number of variables is larger still.

Question

Can we plot a Correlogram for this dataset?

```{r}
#library(corrplot)

galton_num_var <- Galton %>% select(father, mother, height, nkids)
galton_cor <- cor(galton_num_var)
galton_cor %>%
  corrplot(method = "ellipse",
           type = "lower",
           main = "Correlogram for Galton dataset")
```

Clearly height is positively correlated to father and mother; interestingly, height is negatively correlated ( slightly) with nkids.

Question

Let us confirm with a correlation test:

We will use the mosaic function cor_test to get these results:

```{r}
#| column: body-outset-right
mosaic::cor_test(height ~ father, data = Galton) %>% 
  broom::tidy() %>% 
  knitr::kable(digits = 2,
               caption = "Children vs Fathers")
```
Children vs Fathers
estimate statistic p.value parameter conf.low conf.high method alternative
0.28 8.57 0 896 0.21 0.33 Pearson’s product-moment correlation two.sided
```{r}
#| column: body-outset-right
mosaic::cor_test(height ~ mother, data = Galton) %>% 
  broom::tidy() %>% 
    knitr::kable(digits = 2,
               caption = "Children vs Mothers")
```
Children vs Mothers
estimate statistic p.value parameter conf.low conf.high method alternative
0.2 6.16 0 896 0.14 0.26 Pearson’s product-moment correlation two.sided
Correlation Scores and Uncertainty

Note how the mosaic::cor_test() reports a correlation score estimate and the p-value for the same. There is also a confidence interval reported for the correlation score, an interval within which we are 95% sure that the true correlation value is to be found.

Note that GGally::ggpairs() too reports the significance of the correlation scores estimates using *** or **. This indicates the p-value in the scores obtained by GGally; Presumably, there is an internal cor_test that is run for each pair of variables and the p-value and confidence levels are also computed internally.

In both cases, we used the formula \(height \sim other-variable\), in keeping with our idea of height being the dependent, target variable..

We also see the p.value for the estimateed correlation is negligible, and the conf.low/conf.high interval does not straddle \(0\). These attest to the significance of the correlation score.

Question

What does this correlation look when split by sex of Child?

```{r}
#| column: body-outset-right
#| results: hold
# For the sons

mosaic::cor_test(height ~ father,
                 data = Galton %>% filter(sex == "M")) %>% 
  broom::tidy() %>% knitr::kable(digits = 2,
                                 caption = "Sons vs Fathers")
cor_test(height ~ mother, 
         data = Galton %>% filter(sex == "M")) %>% 
  broom::tidy() %>% knitr::kable(digits = 2,
                                 caption = "Sons vs Mothers")

# For the daughters
cor_test(height ~ father, 
         data = Galton %>% filter(sex == "F")) %>% 
  broom::tidy() %>% knitr::kable(digits = 2,
                                 caption = "Daughters vs Fathers")
cor_test(height ~ mother, 
         data = Galton %>% filter(sex == "F")) %>% 
  broom::tidy() %>% knitr::kable(digits = 2,
                                 caption = "Daughters vs Mothers")
```
Sons vs Fathers
estimate statistic p.value parameter conf.low conf.high method alternative
0.39 9.15 0 463 0.31 0.47 Pearson’s product-moment correlation two.sided
Sons vs Mothers
estimate statistic p.value parameter conf.low conf.high method alternative
0.33 7.63 0 463 0.25 0.41 Pearson’s product-moment correlation two.sided
Daughters vs Fathers
estimate statistic p.value parameter conf.low conf.high method alternative
0.46 10.72 0 431 0.38 0.53 Pearson’s product-moment correlation two.sided
Daughters vs Mothers
estimate statistic p.value parameter conf.low conf.high method alternative
0.31 6.86 0 431 0.23 0.4 Pearson’s product-moment correlation two.sided

The same observation as made above ( p.value and confidence intervals) applies here too and tells us that the estimated correlations are significant.

Visualizing Uncertainty in Correlation Estimates

We can also visualize this uncertainty and the confidence levels in a plot too, using gf_errorbar and a handy set of functions within purrr which is part of the tidyverse. Assuming heights is the target variable we want to correlate every other (quantitative) variable against, we can proceed very quickly as follows: we will first plot correlation uncertainty for one pair of variables to develop the intuition, and then for all variables against the one target variable:

```{r}
#| label: Single Correlation Uncertainty
mosaic::cor_test(height ~ mother, data = Galton) %>% 
  broom::tidy() %>% 

# We need a graph not a table 
# So comment out this line from the earlier code
#knitr::kable(digits = 2,caption = "Children vs Mothers")

rowid_to_column(var = "index") %>% # Need an index to plot with
  
  # Uncertainty as error-bars
  gf_errorbar(conf.high + conf.low ~ index, linewidth = 2) %>% 
  
  # Estimate as a point
  gf_point(estimate ~ index, color = "red", size = 6) %>% 
  
  # Labels
  gf_text(estimate ~ index - 0.2, 
             label = "Correlation Score = estimate") %>% 
  gf_text(conf.high*0.98 ~ index - 0.25, 
           label = "Upper Limit = estimate + conf.high") %>%   
  gf_text(conf.low*1.04 ~ index - 0.25, 
           label = "Lower Limit = estimate - conf.low") %>% 
  gf_theme(theme_bw())
```

We can now do this for all variables against the target variable height, which we identified in our research question. We will use the iteration capabilities offered by the tidyverse package, purrr:

```{r}
#| label: corrtest plots

all_corrs <- Galton %>% 
  select(where(is.numeric)) %>% 
  
  # leave off height to get all the remaining ones
  select(- height) %>%  
  
  # perform a cor.test for all variables against height
  purrr::map(.x = .,
             .f = \(x) cor.test(x, Galton$height)) %>%
  
  # tidy up the cor.test outputs into a tidy data frame
  map_dfr(broom::tidy, .id = "predictor") 

all_corrs

all_corrs %>% 
  
  # arrange the predictors in order of their correlation scores
  # with the target variable (`height`)
  # Add errorbars to show uncertainty ranges / confidence intervals
  # Use errorbar width and linewidth fo emphasis
  gf_errorbar(conf.high + conf.low ~ reorder(predictor, estimate),
              color = ~ estimate,
              width = 0.2,
              linewidth = ~ -log10(p.value)) %>% 
  
  # All correlation estimates as points
  gf_point(estimate ~ reorder(predictor, estimate), 
           color = "black") %>% 
  
  # Reference line at zero correlation score
  gf_hline(yintercept = 0, color = "grey", linewidth = 2) %>% 
  
  # Themes,Titles, and Scales
  gf_labs(x = NULL, y = "Correlation with height in Galton", 
          caption = "Significance = - log10(p.value)") %>% 
  
  gf_refine(
    
    # Scale for colour
 scale_colour_distiller("Correlation", type = "div", palette = "RdBu"),
            
    # Scale for dumbbells!!
    scale_linewidth_continuous("significance",
                                       range = c(0.5,4))) %>% 
  gf_refine(guides(linewidth = guide_legend(reverse = TRUE))) %>%
  gf_theme(theme_classic())
```

We can clearly see the size of the correlations and the confidence intervals marked in this plot. father has somewhat greater correlation with children’s height, as compared to mother. nkids seems to matter very little. This kind of plot will be very useful when we pursue linear regression models.

Question

How can we show this correlation in a set of Scatter Plots + Regression Lines? Can we recreate Galton’s famous diagram?

```{r}
#| layout-ncol: 2
# For the sons
gf_point(height ~ father, 
         data = Galton %>% filter(sex == "M"),
         title = "Soms and Fathers") %>%
  gf_smooth(method = "lm") %>%
  gf_theme(theme_minimal())

gf_point(height ~ mother, 
         data = Galton %>% filter(sex == "M"),
         title = "Sons and Mothers") %>%
  gf_smooth(method = "lm") %>%
  gf_theme(theme_minimal())

# For the daughters
gf_point(height ~ father, 
         data = Galton %>% filter(sex == "F"),
         title = "Daughters and Fathers") %>%
  gf_smooth(method = "lm") %>%
  gf_theme(theme_minimal())

gf_point(height ~ mother, 
         data = Galton %>% filter(sex == "F"),
         title = "Daughters and Mothers") %>%
  gf_smooth(method = "lm") %>% 
  gf_theme(theme_minimal())
```

An approximation to Galton’s famous plot1 (see Wikipedia):

```{r}
#| layout-ncol: 2
gf_point(height ~ (father + mother)/2, data = Galton) %>% 
  gf_smooth(method = "lm") %>% 
  gf_density_2d(n = 8) %>% 
  gf_abline(slope = 1) %>% 
  gf_theme(theme_minimal())

gf_point(height ~ (father + mother)/2, data = Galton) %>% 
  gf_smooth(method = "lm") %>% 
  gf_ellipse(level = 0.95, color = "red") %>% 
    gf_ellipse(level = 0.75, color = "blue") %>% 
    gf_ellipse(level = 0.5, color = "green") %>% 
  gf_abline(slope = 1) %>% 
  gf_theme(theme_minimal())
```

How would you interpret this plot2?

Case Study#2: Dataset from NHANES

Let us look at the NHANES dataset from the package NHANES:

```{r}
data("NHANES")
```

Inspecting the Data

```{r}
#| label:  NHANES-inspect
#| layout-ncol: 3
#| column: body-outset-right

NHANES_describe <- inspect(NHANES)

NHANES_describe$categorical
NHANES_describe$quantitative
NHANES
skimr::skim(NHANES)
```
Data summary
Name NHANES
Number of rows 10000
Number of columns 76
_______________________
Column type frequency:
factor 31
numeric 45
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
SurveyYr 0 1.00 FALSE 2 200: 5000, 201: 5000
Gender 0 1.00 FALSE 2 fem: 5020, mal: 4980
AgeDecade 333 0.97 FALSE 8 40: 1398, 0-: 1391, 10: 1374, 20: 1356
Race1 0 1.00 FALSE 5 Whi: 6372, Bla: 1197, Mex: 1015, Oth: 806
Race3 5000 0.50 FALSE 6 Whi: 3135, Bla: 589, Mex: 480, His: 350
Education 2779 0.72 FALSE 5 Som: 2267, Col: 2098, Hig: 1517, 9 -: 888
MaritalStatus 2769 0.72 FALSE 6 Mar: 3945, Nev: 1380, Div: 707, Liv: 560
HHIncome 811 0.92 FALSE 12 mor: 2220, 750: 1084, 250: 958, 350: 863
HomeOwn 63 0.99 FALSE 3 Own: 6425, Ren: 3287, Oth: 225
Work 2229 0.78 FALSE 3 Wor: 4613, Not: 2847, Loo: 311
BMICatUnder20yrs 8726 0.13 FALSE 4 Nor: 805, Obe: 221, Ove: 193, Und: 55
BMI_WHO 397 0.96 FALSE 4 18.: 2911, 30.: 2751, 25.: 2664, 12.: 1277
Diabetes 142 0.99 FALSE 2 No: 9098, Yes: 760
HealthGen 2461 0.75 FALSE 5 Goo: 2956, Vgo: 2508, Fai: 1010, Exc: 878
LittleInterest 3333 0.67 FALSE 3 Non: 5103, Sev: 1130, Mos: 434
Depressed 3327 0.67 FALSE 3 Non: 5246, Sev: 1009, Mos: 418
SleepTrouble 2228 0.78 FALSE 2 No: 5799, Yes: 1973
PhysActive 1674 0.83 FALSE 2 Yes: 4649, No: 3677
TVHrsDay 5141 0.49 FALSE 7 2_h: 1275, 1_h: 884, 3_h: 836, 0_t: 638
CompHrsDay 5137 0.49 FALSE 7 0_t: 1409, 0_h: 1073, 1_h: 1030, 2_h: 589
Alcohol12PlusYr 3420 0.66 FALSE 2 Yes: 5212, No: 1368
SmokeNow 6789 0.32 FALSE 2 No: 1745, Yes: 1466
Smoke100 2765 0.72 FALSE 2 No: 4024, Yes: 3211
Smoke100n 2765 0.72 FALSE 2 Non: 4024, Smo: 3211
Marijuana 5059 0.49 FALSE 2 Yes: 2892, No: 2049
RegularMarij 5059 0.49 FALSE 2 No: 3575, Yes: 1366
HardDrugs 4235 0.58 FALSE 2 No: 4700, Yes: 1065
SexEver 4233 0.58 FALSE 2 Yes: 5544, No: 223
SameSex 4232 0.58 FALSE 2 No: 5353, Yes: 415
SexOrientation 5158 0.48 FALSE 3 Het: 4638, Bis: 119, Hom: 85
PregnantNow 8304 0.17 FALSE 3 No: 1573, Yes: 72, Unk: 51

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1.00 61944.64 5871.17 51624.00 56904.50 62159.50 67039.00 71915.00 ▇▇▇▇▇
Age 0 1.00 36.74 22.40 0.00 17.00 36.00 54.00 80.00 ▇▇▇▆▅
AgeMonths 5038 0.50 420.12 259.04 0.00 199.00 418.00 624.00 959.00 ▇▇▇▆▃
HHIncomeMid 811 0.92 57206.17 33020.28 2500.00 30000.00 50000.00 87500.00 100000.00 ▃▆▃▁▇
Poverty 726 0.93 2.80 1.68 0.00 1.24 2.70 4.71 5.00 ▅▅▃▃▇
HomeRooms 69 0.99 6.25 2.28 1.00 5.00 6.00 8.00 13.00 ▂▆▇▂▁
Weight 78 0.99 70.98 29.13 2.80 56.10 72.70 88.90 230.70 ▂▇▂▁▁
Length 9457 0.05 85.02 13.71 47.10 75.70 87.00 96.10 112.20 ▁▃▆▇▃
HeadCirc 9912 0.01 41.18 2.31 34.20 39.58 41.45 42.92 45.40 ▁▂▇▇▅
Height 353 0.96 161.88 20.19 83.60 156.80 166.00 174.50 200.40 ▁▁▁▇▂
BMI 366 0.96 26.66 7.38 12.88 21.58 25.98 30.89 81.25 ▇▆▁▁▁
Pulse 1437 0.86 73.56 12.16 40.00 64.00 72.00 82.00 136.00 ▂▇▃▁▁
BPSysAve 1449 0.86 118.15 17.25 76.00 106.00 116.00 127.00 226.00 ▃▇▂▁▁
BPDiaAve 1449 0.86 67.48 14.35 0.00 61.00 69.00 76.00 116.00 ▁▁▇▇▁
BPSys1 1763 0.82 119.09 17.50 72.00 106.00 116.00 128.00 232.00 ▂▇▂▁▁
BPDia1 1763 0.82 68.28 13.78 0.00 62.00 70.00 76.00 118.00 ▁▁▇▆▁
BPSys2 1647 0.84 118.48 17.49 76.00 106.00 116.00 128.00 226.00 ▃▇▂▁▁
BPDia2 1647 0.84 67.66 14.42 0.00 60.00 68.00 76.00 118.00 ▁▁▇▆▁
BPSys3 1635 0.84 117.93 17.18 76.00 106.00 116.00 126.00 226.00 ▃▇▂▁▁
BPDia3 1635 0.84 67.30 14.96 0.00 60.00 68.00 76.00 116.00 ▁▁▇▇▁
Testosterone 5874 0.41 197.90 226.50 0.25 17.70 43.82 362.41 1795.60 ▇▂▁▁▁
DirectChol 1526 0.85 1.36 0.40 0.39 1.09 1.29 1.58 4.03 ▅▇▂▁▁
TotChol 1526 0.85 4.88 1.08 1.53 4.11 4.78 5.53 13.65 ▂▇▁▁▁
UrineVol1 987 0.90 118.52 90.34 0.00 50.00 94.00 164.00 510.00 ▇▅▂▁▁
UrineFlow1 1603 0.84 0.98 0.95 0.00 0.40 0.70 1.22 17.17 ▇▁▁▁▁
UrineVol2 8522 0.15 119.68 90.16 0.00 52.00 95.00 171.75 409.00 ▇▆▃▂▁
UrineFlow2 8524 0.15 1.15 1.07 0.00 0.48 0.76 1.51 13.69 ▇▁▁▁▁
DiabetesAge 9371 0.06 48.42 15.68 1.00 40.00 50.00 58.00 80.00 ▁▂▆▇▂
DaysPhysHlthBad 2468 0.75 3.33 7.40 0.00 0.00 0.00 3.00 30.00 ▇▁▁▁▁
DaysMentHlthBad 2466 0.75 4.13 7.83 0.00 0.00 0.00 4.00 30.00 ▇▁▁▁▁
nPregnancies 7396 0.26 3.03 1.80 1.00 2.00 3.00 4.00 32.00 ▇▁▁▁▁
nBabies 7584 0.24 2.46 1.32 0.00 2.00 2.00 3.00 12.00 ▇▅▁▁▁
Age1stBaby 8116 0.19 22.65 4.77 14.00 19.00 22.00 26.00 39.00 ▆▇▅▂▁
SleepHrsNight 2245 0.78 6.93 1.35 2.00 6.00 7.00 8.00 12.00 ▁▅▇▁▁
PhysActiveDays 5337 0.47 3.74 1.84 1.00 2.00 3.00 5.00 7.00 ▇▇▃▅▅
TVHrsDayChild 9347 0.07 1.94 1.43 0.00 1.00 2.00 3.00 6.00 ▇▆▂▂▂
CompHrsDayChild 9347 0.07 2.20 2.52 0.00 0.00 1.00 6.00 6.00 ▇▁▁▁▃
AlcoholDay 5086 0.49 2.91 3.18 1.00 1.00 2.00 3.00 82.00 ▇▁▁▁▁
AlcoholYear 4078 0.59 75.10 103.03 0.00 3.00 24.00 104.00 364.00 ▇▁▁▁▁
SmokeAge 6920 0.31 17.83 5.33 6.00 15.00 17.00 19.00 72.00 ▇▂▁▁▁
AgeFirstMarij 7109 0.29 17.02 3.90 1.00 15.00 16.00 19.00 48.00 ▁▇▂▁▁
AgeRegMarij 8634 0.14 17.69 4.81 5.00 15.00 17.00 19.00 52.00 ▂▇▁▁▁
SexAge 4460 0.55 17.43 3.72 9.00 15.00 17.00 19.00 50.00 ▇▅▁▁▁
SexNumPartnLife 4275 0.57 15.09 57.85 0.00 2.00 5.00 12.00 2000.00 ▇▁▁▁▁
SexNumPartYear 5072 0.49 1.34 2.78 0.00 1.00 1.00 1.00 69.00 ▇▁▁▁▁

Try help("NHANES") in your Console.

This is survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960’s. Since 1999 approximately 5,000 individuals of all ages are interviewed in their homes every year and complete the health examination component of the survey. The health examination is conducted in a mobile examination centre (MEC).

The dataset is described as: A data frame with 100000 observations on 76 variables. Some of these are:
- Race1 and Race2: factors with 5 and 6 levels respectively
- Education a factor with 5 levels
- HHIncomeMid Total annual gross income for the household in US dollars.
- Age
- BMI: Body mass index (weight/height2 in kg/m2)
- Height: Standing height in cm.
- Weight: Weight in kg > > - Testosterone: Testosterone total (ng/dL) - PhysActiveDays: Number of days in a typical week that participant does moderate or vigorous-intensity activity.
- CompHrsDay: Number of hours per day on average participant used a computer or gaming device over the past 30 days.

Missing Data

Why do so many of the variables have missing entries? What could be your guess about the Experiment/Survey`?

Let us make some counts of the data, since we have so many factors:

```{r}
NHANES %>% count(Gender)
NHANES %>% count(Race1)
NHANES %>% count(Race3)
NHANES %>% count(Education)
NHANES %>% count(MaritalStatus)
```

There is a good mix of factors and counts.

Now we articulate our Research Questions:

Research Questions
  1. Does Testosterone have a relationship with parameters such as BMI, Weight, Height, PhysActiveDays CompHrsDay and Age?

  2. Does HHIncomeMid have a relationship with these same parameters? And with Gender?

  3. Are there any other pairwise correlations that we should note? (This is especially useful in choosing independent variables for multiple regression)

( Yes we are concerned with men more than with the women, sadly.)

Correlations and Plots

```{r}
#| message: false
#| warning: false
#| label: "Household Median Income"

GGally::ggpairs(NHANES, 
                # Choose the variables we want to plot for
                columns = c("HHIncomeMid", "Weight", "Height", 
                            "BMI", "Gender"), 
                
                # LISTs of graphs needed at different locations
                # For different combinations of variables 
                diag = list(continuous = "barDiag"),
                lower = list(continuous = wrap("smooth", alpha = 0.01)),
                upper = list(continuous = "cor"),
                
                switch = "both", # axis labels in more traditional locations
                progress = FALSE ) + # No compute progress bars needed
  theme_bw()
```

We see that HHIncomeMid is Quantitative, discrete valued variable, since it is based on a set of median incomes for different ranges of income. BMI, Weight, Height are continuous Quant variables.

HHIncomeMid also seems to be relatively unaffected by Weight; And is only mildly correlated with Height and BMI, as seen both by the correlation score magnitudes and the slopes of the trend lines.

There is a difference in the median income by Gender, but we will defer that kind of test for later, when we do Statistical Inference.

Unsurprisingly, BMI and Weight have a strong relationship, as do Height and Weight; the latter is of course non-linear, since the Height levels off at a point.

```{r}
#| label: Testosterone
#| message: false
#| warning: false

GGally::ggpairs(NHANES, 
                columns = c("Testosterone", "Weight", "Height", "BMI"), 
                
                diag = list(continuous = "barDiag"),
                lower = list(continuous = wrap("smooth", alpha = 0.01)),
                upper = list(continuous = "cor"),
                
                switch = "both",
                progress = FALSE ) +
  theme_bw()
```

It is clear that Testosterone has strong relationships with Height and Weight but not so much with BMI.

Visualizing Uncertainty in Correlation Estimates

Since the pairs plot is fairly clear for both target variables, let us head to visualizing the significance and uncertainty in the correlation estimates.

```{r}
#| label: NHANES errorbars-1

HHIncome_corrs <- NHANES %>% 
  select(where(is.numeric)) %>% 
  
  # leave off height to get all the remaining ones
  select(- HHIncomeMid) %>%  
  
  # perform a cor.test for all variables against height
  purrr::map(.x = .,
             .f = \(x) cor.test(x, NHANES$HHIncomeMid)) %>%
  
  # tidy up the cor.test outputs into a tidy data frame
  map_dfr(broom::tidy, .id = "predictor") 

HHIncome_corrs

HHIncome_corrs %>% 
  
  # Reference line at zero correlation score
  gf_hline(yintercept = 0, color = "grey", linewidth = 2) %>% 
  
  # arrange the predictors in order of their correlation scores
  # with the target variable (`height`)
  # Add errorbars to show uncertainty ranges / confidence intervals
  # Use errorbar width and linewidth fo emphasis
  gf_errorbar(conf.high + conf.low ~ reorder(predictor, estimate),
              color = ~ estimate,
              width = 0.2,
              linewidth = ~ -log10(p.value + 0.001)) %>% 
  
  # All correlation estimates as points
  gf_point(estimate ~ reorder(predictor, estimate), 
           color = "black") %>% 
  
  # Themes,Titles, and Scales
  gf_labs(x = NULL, y = "Correlations with HouseHold Median Income", 
          caption = "Significance = - log10(p.value)") %>% 
  gf_theme(theme_classic()) %>%

  
  # Scale for colour
  gf_refine(guides(linewidth = guide_legend(reverse = TRUE)),
            scale_colour_distiller("Correlation", type = "div", 
                                    palette = "RdBu"),
            
  # Scale for dumbbells!!
  scale_linewidth_continuous("Significance", range = c(0.05,2)),
  
  theme(axis.text.y = element_text(size = 6, hjust = 1)),
  coord_flip()) 
```

If we select just the variables from our Research Question:

```{r}
#| label: NHANES errorbars-2

HHIncome_corrs_select <- NHANES %>% 
  select(Height, Weight, BMI) %>% # Only change is here!
  
  # leave off height to get all the remaining ones
  #select(- HHIncomeMid) %>%  
  
  # perform a cor.test for all variables against height
  purrr::map(.x = .,
             .f = \(x) cor.test(x, NHANES$HHIncomeMid)) %>%
  
  # tidy up the cor.test outputs into a tidy data frame
  map_dfr(broom::tidy, .id = "predictor") 

HHIncome_corrs_select

HHIncome_corrs_select %>% 
  
  # arrange the predictors in order of their correlation scores
  # with the target variable (`height`)
  # Add errorbars to show uncertainty ranges / confidence intervals
  # Use errorbar width and linewidth fo emphasis
  gf_errorbar(conf.high + conf.low ~ reorder(predictor, estimate),
              color = ~ estimate,
              width = 0.2,
              linewidth = ~ -log10(p.value + 0.000001)) %>% 
  
  # All correlation estimates as points
  gf_point(estimate ~ reorder(predictor, estimate), 
           color = "black") %>% 
  
  # Reference line at zero correlation score
  gf_hline(yintercept = 0, color = "grey", linewidth = 2) %>% 
  
  # Themes,Titles, and Scales
  gf_labs(x = NULL, y = "Correlations with HouseHold Median Income", 
          caption = "Significance = - log10(p.value + 0.000001)") %>% 
  
  gf_theme(theme_classic()) %>%

  
  # Scale for colour
  gf_refine(guides(linewidth = guide_legend(reverse = TRUE)),
            scale_colour_distiller("Correlation", type = "div", 
                                    palette = "RdBu"),
            
  # Scale for dumbbells!!
  scale_linewidth_continuous("Significance", range = c(0.05,2)),
  
  theme(axis.text.y = element_text(size = 8, hjust = 1)),
  coord_flip()) 
```

So we might say taller people make more money? And fatter people make slightly less money? Well, the magnitude of the correlations (aka effect size) are low so we would not imagine this to be a hypothesis that we can defend.

Let us look at the Testosterone variable: trying all variables shows some paucity of observations ( due to missing data), so we will stick with our chosen variables:

```{r}
#| label: testosterone errorbars

Testosterone_corrs <- NHANES %>%
  select(Height, Weight, BMI) %>%
  
  # leave off height to get all the remaining ones
  #select(- Testosterone) %>%
  
  # perform a cor.test for all variables against height
  purrr::map(.x = .,
             .f = \(x) cor.test(x, NHANES$Testosterone)) %>%
  
  # tidy up the cor.test outputs into a tidy data frame
  map_dfr(broom::tidy, .id = "predictor")

Testosterone_corrs

Testosterone_corrs %>%
  
  # Reference line at zero correlation score
  gf_hline(yintercept = 0,
           color = "grey",
           linewidth = 2) %>%
  
  # arrange the predictors in order of their correlation scores
  # with the target variable (`height`)
  # Add errorbars to show uncertainty ranges / confidence intervals
  # Use errorbar width and linewidth fo emphasis
  gf_errorbar(
    conf.high + conf.low ~ reorder(predictor, estimate),
    color = ~ estimate,
    width = 0.2,
    linewidth = ~ -log10(p.value + 0.000001)
  ) %>%
  
  # All correlation estimates as points
  gf_point(estimate ~ reorder(predictor, estimate),
           color = "black") %>%
  
  
  # Themes,Titles, and Scales
  gf_labs(x = NULL, y = "Correlations with Testosterone Levels",
          caption = "Significance = - log10(p.value + 0.000001)") %>%
  
  gf_theme(theme_classic()) %>%
  
  
  # Scale for colour
  gf_refine(
    guides(linewidth = guide_legend(reverse = TRUE)),
    scale_colour_distiller("Correlation", type = "div",
                           palette = "RdBu"),
    
    # Scale for dumbbells!!
    scale_linewidth_continuous("Significance", range = c(0.05, 2)),
    
    theme(axis.text.y = element_text(size = 8, hjust = 1)),
    coord_flip()
  ) 
```

Conclusion

We have a decent Correlations related workflow in R:

  • Load the dataset
  • inspect/skim/glimpse the dataset, identify Quant and Qual variables
  • Identify a target variable based on your knowledge of the data, how it was gathered, who gathered it and what was their intent
  • Develop Pair-Wise plots + Correlations using GGally::ggpairs()
  • Develop Correlogram corrplot::corrplot
  • Check everything with a cor_test: effect size,significance, confidence intervals
  • Use purrr + cor.test to plot correlations and confidence intervals for multiple Quant predictor variables against the target variable
  • Plot scatter plots using gf_point.
  • Add extra lines using gf_abline() to compare hypotheses that you may have.

References

R Package Citations
Package Version Citation
corrplot 0.92 Wei and Simko (2021)
GGally 2.2.0 Schloerke et al. (2023)
ggformula 0.12.0 Kaplan and Pruim (2023)
mosaic 1.9.0 Pruim, Kaplan, and Horton (2017)
mosaicData 0.20.4 Pruim, Kaplan, and Horton (2023)
NHANES 2.1.0 Pruim (2015)
TeachHist 0.2.1 Lange (2023)
TeachingDemos 2.12 Snow (2020)
Kaplan, Daniel, and Randall Pruim. 2023. ggformula: Formula Interface to the Grammar of Graphics. https://CRAN.R-project.org/package=ggformula.
Lange, Carsten. 2023. TeachHist: A Collection of Amended Histograms Designed for Teaching Statistics. https://CRAN.R-project.org/package=TeachHist.
Pruim, Randall. 2015. NHANES: Data from the US National Health and Nutrition Examination Study. https://CRAN.R-project.org/package=NHANES.
Pruim, Randall, Daniel T Kaplan, and Nicholas J Horton. 2017. “The Mosaic Package: Helping Students to Think with Data Using r.” The R Journal 9 (1): 77–102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.
Pruim, Randall, Daniel Kaplan, and Nicholas Horton. 2023. mosaicData: Project MOSAIC Data Sets. https://CRAN.R-project.org/package=mosaicData.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2023. GGally: Extension to ggplot2. https://CRAN.R-project.org/package=GGally.
Snow, Greg. 2020. TeachingDemos: Demonstrations for Teaching and Learning. https://CRAN.R-project.org/package=TeachingDemos.
Wei, Taiyun, and Viliam Simko. 2021. R Package corrplot: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
Back to top

Footnotes

  1. http://euclid.psych.yorku.ca/SCS/Gallery/images/galton-corr.jpg>↩︎

  2. https://www.researchgate.net/figure/Galtons-smoothed-correlation-diagram-for-the-data-on-heights-of-parents-and-children_fig15_226400313↩︎

Citation

BibTeX citation:
@online{venkatadri2022,
  author = {Venkatadri, Arvind},
  title = {Tutorial on {Correlations} in {R}},
  date = {2022-11-22},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/30-Correlations/files/correlations.html},
  langid = {en},
  abstract = {Tests, Tables, and Graphs for Correlations in R}
}
For attribution, please cite this work as:
Venkatadri, Arvind. 2022. “Tutorial on Correlations in R.” November 22, 2022. https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/30-Correlations/files/correlations.html.